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Abstract 

We consider the problem of learning a measure of distance among vectors in a feature space and 
propose a hybrid method that simultaneously learns from similarity ratings assigned to pairs of vectors 
and class labels assigned to individual vectors. Our method is based on a generative model in which 
class labels can provide information that is not encoded in feature vectors but yet relates to perceived 
similarity between objects. Experiments with synthetic data as well as a real medical image retrieval 
problem demonstrate that leveraging class labels through use of our method improves retrieval perfor- 
mance significantly. 
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1 Introduction 

Consider a retrieval system that, given features of an object, searches a database for similar objects. Such a 
system requires a distance metric for assessing similarity. One way to produce a distance metric is to learn 
from similarity ratings that representative users have assigned to pairs of objects. Given data of this kind, 
ratings can be regressed onto differences between object features. 

In this paper, we consider the use of class labels in addition to similarity ratings to learn a distance metric. 
Labels may be available, for example, if each object is assigned a class when entered into the database. The 
class label does not serve as an additional feature because when searching for objects similar to a new one, 
the class of the new object is usually unknown. In fact, the purpose of the retrieval system may be to supply 
similar objects and their class labels to assist the user in classifying the new object. However, class labels 
provide information useful to learning the distance metric because they may relate to similarity ratings in 
ways not captured by extracted features. 

While distance metric learning has attracted much attention in recent years, approaches that have been 
proposed generally learn from either similarity /difference data or class labels but not both. We will refer to 
these two types of approaches as similarity -based and class-bas ed methods, respectively. In the former cate- 
gory are multidimensional scaling methods (jCox and which embed vectors in a Euclidean space so 
that distances betwee n pairs are close to available estimates, ordinal regression (iMcCullaeh and Nelderl . [l989t 
iHerbrich et"alll2000l) . which learns a function t hat maps feature differences to discrete levels of nieasured sim- 



ilarit y, and convex optimization formulations (jXing et al.l . 120021 : iSchultz and Joachims! . I2004t iFrome et al 



l2006f ). which learn metrics that tend to make data pairs classified as similar close and other s distant. As for 
class- based methods, examples include relevant component analysis (jBar-Hillel et al.l . 120031 ) . which aims to 
learn a metric that makes data p oints that share a class close and others distant, neighbourhood component 



analysis (jGoldberger et al.l . l2005i) , which learns a distance metric by optimizing the pro bability of correct 



classif i cation based on a softmax mod el a nd nearest neighbors, and t he algorithms of [Weinberger et al 



( 20061 ) I Weinberger and Tesauro. ( 2007 ). and Weinberger and Saull (12009), which minimize the distances be- 
tween objects in each neighborhood that share the same class while separating those from different classes. 
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Our hybrid method of distance metric learning advances the aforementioned hterature by providing an 
effective algorithm that makes use of both kinds of data simultaneously. It consists of two stages: a soft 
classifier is learned from the class label data and then used together with the similarity rating data by any 
similarity-based distance metric learning algorithm. Although this method can make use of any algorithm 
for learning a soft classifier and any similarity-based distance metric learning algorithm, to best illustrate 
our idea we will focus on the combination of a kernel density estimation algorithm similar to neighborhood 
component analysis and the aforementioned convex optimization approach to learning from similarity ratings. 
Results from experiments with synthetic data as well as a real medical image retrieval problem demonstrate 
that this hybrid method improves retrieval performance significantly. 



2 Problem Formulation 
2.1 Data 

Suppose features of each object are encoded in a vector x G K-'^'. We are given a data set consisting of 
similarity ratings for pairs of objects and class labels for individual objects. The ratings data is comprised 
of a set S of quintuplets (o, o', x, x' ,a), each consisting of two object identifiers o and o', associated feature 
vectors x and x' , and a similarity rating a. We assume that each similarity rating takes one of three values, 
in particular, 1, 2, and 3, conveying dissimilarity, neutrality, and similarity, respectively. Denote the number 
of classes by M and index each class by an integer from 1 through M. The class label data is a set G of 
triplets (o, a;, c), each consisting of an object identifier o, a feature vector x, and a class c € {1, 2, . . . , M}. 
The reason that object identifiers are included in the data is so that we know when a given class label is 
associated with the same object as a given similarity rating. In order to compress notation, when the object 
identifiers are not relevant to a discussion, we will refer to data samples in S as triplets (x,x',(t) and data 
in G as pairs {x,c). 



2.2 Distance Metric 



A distance metric is a mapping from X to R+ which assesses the distance of any given pair of objects. 
Given a a class of distance metrics dr : x R+, which is parameterized by a vector r, we wish to 

compute r so that the resulting distance metric accurately reflects perceived distances. Though the methods 
we present apply to a variety of distance metrics, much of our discussion will focus on the popular choice of 
a weighted Euclidean norm: 



dr{x, x') 



\ 



K 

E 

fc=i 



(1) 



3 Algorithms 

Our goal is to learn a distance metric d : R^' x M.^ — ^ IR+ that help us retrieve similar objects in the database. 
We now discuss three existing algorithms for doing so and propose a new hybrid algorithm. 



3.1 Ordinal Regression 

Ordinal regression (|McCullagh and Neldeii Il989l) offers a simple approach to learning coefficients from the 



similarity rating data S. Ordinal regression typically assumes that given a pair of objects {x, x'), similarity 
ratings obeys the conditional distribution 



P{a < v\x, x') = — — ^ 

1 -t- exp(— dr(a;, x'Y — dv) 



where v G {1, 2, 3} denotes the level of similarity, and 0i < 02 are boundary parameters (we have implicitly 
03 = oo ). These parameters, together with the coefficients r, are computed by solving a maximum likelihood 
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problem: 



max ^ logP{a\x,x') 

S.t r > 

9i < 92. 

Constraints are imposed on r because, given the way our distance metric is defined in ([I}, coefficients of any 
suitable distance metric should be nonnegative. Note that this algorithm only makes use of the rating data 
S. 

3.2 Convex Optimization 

Another approach, proposed in lXing et al. ( 2002 ). computes r by solving a convex optimization problem: 



min d'^{x,x') 

(x,x' ,(T—3)(^S 

S.t. ^ dr{x, x') > 1 

{x,x' ,(T—1)GS 

r > 0. 

This formulation results in a distance metric that aims to minimize the distances between similar objects 
while keeping dissimilar ones sufficiently far apart. Similarly with ordinal regression, this algorithm only 
makes use of the rating data S. 

3.3 Neighborhood Component Analysis 

Neighborhood component analysis (NCA) learns a distance metric from class labels based on an assumption 
that similar objects are more likely to share the same class than dissimilar ones. NCA employs a model in 
which a feature vector is assigned class label with probability 

J2 exp{-dl{x'<,x)) 
{x',c')eG 

NCA computes coefficients that would lead to accurate classification of objects in the training set Q. We 
will define accuracy here in terms of log likelihood. In particular, we consider an implementation that aims 
to produce coefficients by maximizing the average leave-one-out log-likelihood. That is, 

max ^ \ogP[c\x,g \{x,c)). (3) 

{x,c)eg 

This optimization problem is not convex, but in our experience a local-optimum can be found efficiently via 
projected gradient ascent. In many practical cases the number of training samples is not much larger than 
the number of parameters K, and NCA consequently suffers from overfitting. Therefore, we consider Li 
regularization in our application of NCA. In particular, we subtract a penalty term A||r||i from where 
the parameter A is selected by cross-validation. Further details about our implementation can be found in 
the appendix. 

3.4 A Hybrid Method 

We now introduce a hybrid method that simultaneously makes use of similarity ratings and class labels. Our 
approach is motivated by an assumption that similarity ratings are driven by a weighted Euclidean norm 
distance metric, but that the observed feature vectors may not express all relevant information about objects 
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being compared. In particular, there may be "missing features" that influence the underlying distance metric. 
Given objects o and o' with observed feature vectors x,x' € and missing feature vectors z, z' G M'', we 
assume the underlying distance metric is given by 

\fc=i j=i 
= {dl{x,x')+dl^{z,z')f\ 

where r e and G 

Another important assumption we will make concerning the missing feature vector is that it is con- 
ditionally independent from the observed feature vector when conditioned on the class label. In other 
words, given an object with observed and missing feature vectors x and z and a class label c, we have 
p{x, z\c) = p{x\c)p{z\c) . This assumption is justifiable since, if there exists any correlation between x and z, 
then we can subtract this dependence from z, resulting in another random variable z' , and replace z by z' 
without loss of generality. 

Now suppose we are given a learning algorithm A that learns the conditional class probabilities P{c\x) 
from class data Q. In other words, .4. is a function that maps Q into an estimate P{-\-)- Using these conditional 
class probabilities P, we generate a soft class label for each unlabeled object represented in 5, our similarity 
ratings data set, that is not labeled in the class data set Q. In particular, for an unlabeled object o with 
feature vector x, we generate a vector u{o) G R^, with each mth component given by Um{o) = P{m\x). For 
uniformity of notation, we also define for each object o from Q, the set with class labels, a vector u{o). In 
this case, if c is the class label assigned to o then Uc{o) = 1 and Um{o) = for m 7^ c. 

We now discuss how the similarity ratings data <S is used together with these class probability vectors to 

produce a distance metric. The main idea is to generate an estimate of (£[25^(0, o')\x, x' , u(o), u(o')]) ^ that 
is consistent with observed similarity ratings. The conditioning on u{x) and u{x') here indicates that these 
vectors are taken to be the class probabilities associated with the two objects. 
Note that 

^[V'^{o,o')\x,x' ,u{o),u{o')\ 
= dl{x, x') + E[d^x {z, z')\x, x',u{o), u{o')] 

and using the conditional independence assumption we have 

F.[dl^{z,z')\x,x',u{o),u{o')] 

= ^ E[d^i {z, z')\x, x', c, c']uc{o)uc' (o') 

= ^E[d^^(z,z')|c,C>e(oK'(o') 

= u{ofQu{o'), 

where Q e M^^^ is defined as 

Qc,c' =E[dl^{z,z')\c,c'], l<c,c' <M. 

We can view Q as a matrix that encodes distance information relating to missing features. This motivates 
the following parameterization of a distance metric, which is what we will use: 

= {nv\o,o')\x,x',u{o),u{o')]y^ 

= {dlix,x')+u{oyQu{o'))K 

Note that in the event that class labels are not provided for o and o', the class probability vectors depend 
only on x and x' . Therefore, with some abuse of notation, when there are no class labels, we can write the 
distance metric as 

d'^^Q{x,x') = {dl{x,x')+u{xyQuix')y . 
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Our hybrid method estimates the vector r G M.^ and matrix Q € ^mxm ^Yia,t they are consistent 
witli similarity ratings. To do so, it makes use of a similarity-based learning algorithm B that learns the 
coefficients of a distance metric from feature differences and similarity ratings, such as the ordinal regression 
or convex optimization methods we have described. 

To provide a concrete version of our hybrid method, we consider the case where ,4 is a kernel density 
estimation procedure similar to NCA and B is the algorithm based on convex optimization, discussed in 
Section [3?2] In this case, the method first generates a feature vector density for each class according to 

P{x\c) ^ , ^ V JV^,{x-x'), 

\{x',c' = c)eg\ 

(x' ,c'=c)£y 

where J\fw is a Gaussian kernel, defined by 



J\fw{x) oc exp ^- ^ Wkxlj 



To produce conditional class probabilities, we estimate the marginal distribution of classes according to 
and applying Bayes' rule to arrive at 

P{c)p{x\c) 



Pic\x) ^ M - 

Em=i -P("^)p(a;|m) 

The Gaussian kernel parameters w can be estimated by a similar approach as described in ([3]). Then, to 
compute estimates f and Q, we solve the following convex optimization problem: 

min \^ dr{x, x')'^ + u{o)'^ Qu(o') 

r,Q ^ — ' 

(0,0' ,0"— 3)G«S 



S.t. ^ dr{x,x'f + u{oy Qu{o') > 1 



r > 

Q > and symmetric. 

This is the hybrid method we use in our experiments. Note that we only require Q to be element-wise 
non-negative, but not positive semidefinite, and as such our method does not entail solution to an SDP. 



4 Experiments 

We evaluate the aforementioned four algorithms, namely ordinal regression (OR), convex optimization (CO), 
neighborhood component analysis (NCA), and the hybrid method (HYB), in two experiments. In the first 
experiment, we generate 100 synthetic data sets by a sampling process. For the second experiment, a real 
data set consisting of feature vectors derived from computed tomography (CT) scans of liver lesions, along 
with diagnoses and comparison ratings provided by radiologists, is considered. The data was collected as 
part of a project tha t seeks to develop a similarity-based image retrieval system for radiological decision 
support (|Napel et al.l |2010[) . We now describe the settings and empirical results of both experiments in 



detail. 

It is worth mentioning that relative to other algorithms we consider, the hybrid method increases the 
number of free variables by M{M A- l)/2, which is the number of numerical values used to represent the 
symmetric matrix Q. Since the number of classes M is usually much smaller than the number of features 
we do not expect this increase in degrees of freedom to drive differences in empirical results. For instance, 
in the medical image dataset we study, we have K = and M = 3, so our hybrid method only introduces 
6 new variables to the 60 variables used by other methods. 
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4.1 Synthetic Data 



The fohowing procedure explains how we generate and conduct experiments with synthetic data: 

1. Sample a generative model and coefScient vectors r and . Further details about this sampling process 
can be found in the appendix. 

2. Generate 200 data points from the resulting generative model; denote it by a set O = {(o'^"\ x^"-*, 2*^"^ c'-"^) : 
n= 1,2,--- ,200}. 

3. For each integer pair (a, b),l < a,b < 200, a^h, let 

k=l 3 = 1 

where e^"''') is sampled iid from A/'(0, 50^) to represent the random noise in rating. This results in 
39, 800 distance values. Let 2/20% be their first quintile and 2/50% be their median. We set 

r 3 if y^'^^") < 2/20% 
a('^^'') = <^ 2 if 2/20% <y''^''^ < 2/50% 
I 1 otherwise. 



4. Let X = {(o(^),a;(*)) : I < i < 100} be the training set and X = {(0^,2;^) : 101 < i < 200} be the 
testing set. Take g = {(oW,x(*),c(*)) : 1 < i < 100} be the label data set. 

5. Let S = {(oW,oW,x«^a;(j),CT(*'j)) : 1 < i,j < 100, i ^ 3} and S = {(o^, o^^), x^J) , : 1 < 
j < 100 < i < 200}. S will be used for testing, and for training we sample 5 subsets of S, namely 
Si, . . . ,55, such that the sizes of these sets equal to 5%, 7.5%, 10%, 12.5% and 15% of the size of S, 
respectively. The reason for using Si, ... ,3^ as our training sets is that in many practical contexts it 
is not feasible to gather an exhaustive set of comparison data that rates all pairs of feature vectors as 
does S. 

6. For / = 1, 2, . . . , 5, run OR, CO, NCA, and HYB on the datasets {X, Q, Sf), resulting in four distance 
measures. Then for every x^") S X, apply each distance measure to retrieve the top 10 closest objects in 
X, and evaluate the retrieved list by normalized discounted cumulative gain at position 10 ( NDCGio), 
defined as 

DCGio 
Ideal DCGio 

^l0g2(l+p) 

10 2ff("-'p' _ I 

^l0g2(l+p) 

where ip is the pth most similar object to a;*^"^ based on the distance measure in test and i* is the pth 
most similar object based on the ratings in S. We use NDCGio as our evaluation criterion since it is 
the most commonly used one when assessing relevance. 

The above procedure was repeated for 100 times, resulting in 100 different generative models and data sets. 
Figure [1] plots the average NDCGio delivered by OR, CO, NCA, and HYB. The advantage of HYB becomes 
singificant as the size of the rating data set grows. 



NDCGio = 
Ideal DCGio = 

DCGio = 
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0.58 




0.5' ' ' ' ' ' 1 

400 600 800 1000 1200 1400 1600 

size of rating data set 

Figure 1: The average NDCGio delivered by OR, CO, NCA, and HYB, over different sizes of rating data 
set. For statistical interpretation, we also give the error bars (one standard deviation) in the plots. 



4.2 Real Data 

Our real data set consists of thirty medical images, each correspon ding to a distinct CT scan. Features 
of each image included semantic annotations given by a radiologist (jRubin et al.l . 120081 ) usin g a controlled 



vocabulary and quantita tive features suc h as lesion border sharpness, histogram s tatistics (IBilello et al 



2004 iRubin et all l2008l) . Haar wavelets (jStrela et al.l Il999t ). and Gabor textures (Z hao et all . 12004 ). A 



total of 479 features were extracted from each image, many of which are linearly dependent. To simplify the 
computation, we removed those features whose correlations are above 0.95, and normalized the remaining 
ones. This resulted in 60 features which we used in our study. 

For each pair among the thirty CT scans, we collected two ratings of image similarity from two different 
radiologists. Each image was classified with one of three dianoses: cyst, metastasis, or hemangioma. Figure 
[2] demonstrates some sample images in our data set. 



cyst 



metastasis 



hemangioma 




>.'-":A 




Figure 2: Sample images in our data set. Each row of the images corresponds to diagnosis cyst, metastasis, 
and hemangioma, respectively. The red circles in each image are annotated by a radiologist to indicate the 
regions of interest. 
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To connect the aforementioned quantities to notation we have introduced, note that the number of 
features is if = 60, and the number of classes is M = 3. Denote the set of image- feature pairs hy X = 
{(o(*),a;(*)) : 1 < i < 30}, the class label data by G = {(o**), x^*), c(*)) : 1 < i < 30}, and the similarity rating 
data by 5 = {{o'-'\o'^^\x'-'\x'-j\cr'^''^^) 

■ 1 < hj < 30, i 7^ j}. Tables [1] and [5] provide frequencies with which 
different ratings and classes appear in the data set. 



Table 1: The distribution of ratings. 



Rating Frequency 

1 (Dissimilar) 58.6% 

2 (Neutral) 16.2% 

3 (Similar) 25.2% 



Table 2: The distribution of classes. 



Class Frequency 

Cyst 44% 

Metastasis 33% 

Hemangioma 23% 



Since the data points are not very abundant in this case, we use leave-one-out cross-validation to evaluate 
the performance. More specifically, for n = 1, 2, . . . , 30, we do the following: 

1. Let X-n = A'\(o("),x(")). 

2. Let =g\(o("),a;("),c(")). 

3. Let 5_„ = 5 \ {(oW,o(^\a;W,a;(j'),cr(*^^)) : i ^ n or j = n} 

4. Apply the four methods OR, CO, NCA, and HYB on {X^n,G-n,S-n)- 

5. Use each of the resulting distance measures to retrieve the top 10 images from X^n that are closest to 

6. Evaluate the NDCGio of the retrieved lists. 

Figure El plots the average NDCGio delivered by OR, CO, NCA, and HYB. As we can see, HYB leads 
the other methods by a significant margin of more than 8 percent (0.75 vs. NCA's 0.67). 



5 Conclusion 



We have presented a hybrid method that learns a distance measure by fusing similarity ratings and class 
labels. This approach consists of two elements, including an algorithm that learns the class probability 
conditioned on feature through label data, and another algorithm that fits model coefficients so that the 
resulting distance measure is consistent with similarity ratings. In our implementation, NCA and CO are 
chosen for these two elements, respectively. We tried the algorithm on synthetic data as well as a data set 
collected for the purpose of developing a medical image retrieval system, and demonstrated that it provides 
substantial gains over various methods that learn distance metrics exclusively from class or similarity data. 

As a parting thought, it is worth mentioning that our hybrid method combines elements of genera- 
tive and discriminative learning. There has been a growing lit erature that explores such combinations 
(jjaakkola and Haussleilll998HRarna et al.l . [ioollKao et all . l2009l) and it would be interesting to explore the 
relationship of our hybrid method to other work on this broad topic. 
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Figure 3: The average NDCGio delivered by OR, CO, NCA, and HYB for the medical image data set. For 
statistical interpretation, we also give the error bars (one standard deviation) in the plots. 

Appendix: Implementation Details 

Li-regularized NCA 

In our implementation, we randomly partition class label data set Q into a training set Qt and a validation 
set ^v, whose sizes are roughly 70% and 30% of Q, respectively. For each A G {1, 2, 4, 8, 16}, we solve 

max logP(c|a;,^t \ (a;,c)) - Al|r||i 

(x,c)egt 

by projected gradient ascent. We then compute the log-likelihood of the validation set, given by 

J2 log p{c\x,gt), 

and select the value of A that results in the highest log-likelihood. The resulting value of A is subsequently 
applied as the regularization parameter when we solve for r with the complete training set Q. The range of 
A is determined through trial and error and chosen so that in our experiments the optima rarely took on 
extreme values. 

Sampling Generative Model 

We take K — 20, J = 20, and M — 3 for the synthetic data experiment. Algorithm [T] is the procedure we 
use to sample the generative models. Here we set p{x\c) and p{z\c) as mixtures of Gaussian distributions. 
This procedure was repeated 100 times to produce 100 generative models. 
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